GPU and high speed computing

prysm has simple, transparent operation on both CPU and GPU (or in fact any module with a numpy compatible API). With a single line, you can reconfigure the “backend” of its engine and perform computing on a GPU. Consider the following, which is done on a computer with an intel i7-9700K CPU and Nvidia GTX 2080 GPU:

[1]:
import numpy as np
import cupy as cp

from prysm import Pupil, PSF, MTF
from prysm import config
from prysm.mathops import engine
from prysm.coordinates import gridcache
from prysm.geometry import mcache
from prysm.zernike import zcachemn

A few functions used for some routines are not yet implemented in cupy, so an error will be generated with the ordinary config.backend = cp way of makign the change. We can still use a lower level mechanism, which avoids re-vectorizing the Jinc implementation used for analytical airy disks.

[2]:
# case 1, CPU, large scale simulation, fp64
config.precision = 64
[3]:
%timeit p = Pupil(samples=2048); ps = PSF.from_pupil(p, efl=1, Q=4); mt = MTF.from_psf(ps);
9.57 s ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

While the values here are not important, the amount of computational work needed is likely clear. The simulation takes quite a while, and if this were in an optimization loop (say, parameter iteration in design or phase retrieval), the performance is probably not satisfactory. We can reduce the numerical precision to speed thing up:

[4]:
gridcache.clear()
mcache.clear()
zcachemn.clear()
config.precision = 32
%timeit p = Pupil(samples=2048); ps = PSF.from_pupil(p, efl=1, Q=4); mt = MTF.from_psf(ps);
9.19 s ± 82.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On windows, this should be about twice as fast. On MacOS and Linux, it probably makes no difference. A few seconds is still quite a long time to wait, luckily we can go faster. Because we’re changing the backend at a lower level for now, we need to dump a few caches. Assigning to config.backend does this for us, but will error with the current version of cupy (6.2).

[5]:
gridcache.clear()
mcache.clear()
zcachemn.clear()
engine.source = cp

With these four lines (the first three of which are superfluous if we have never done anything on CPU in this python session), prysm will now use the GPU for all calculations. While the GPU may not be optimal for every single one, it is majoritatively superior. How much superior?

[6]:
# still fp32
%timeit p = Pupil(samples=2048); ps = PSF.from_pupil(p, efl=1, Q=4); mt = MTF.from_psf(ps);
95.6 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

About 100 fold, “for free,” as long as you have the hardware! How about with double precision, which may be a firm requirement?

[7]:
gridcache.clear()
mcache.clear()
zcachemn.clear()
config.precision = 64
%timeit p = Pupil(samples=2048); ps = PSF.from_pupil(p, efl=1, Q=4); mt = MTF.from_psf(ps);
269 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It is somewhat common knowledge that GPUs perform worse with double precision, here is some evidence to that. We are still in the domain of a few dozen fold performance improvement, which is the difference between a full work day and an hour.

As a performance tip when using the GPU for tasks like phase retrieval, do everything on GPU and then move the cost function value back to the host (cpu) as a single double precision float and give that to the optimizer. Or, use a different backend than cupy which has its optimizers available on GPU (such as chainer, or other ML frameworks). You can make use of their autograd code for “free” jacobian calculation, too, by using their variable types as inputs to prysm. If you combine the autograd, which is relatively little work, with 32-bit calculation and a GPU backend, you can speed up your phase retrieval routine on the order of a thousand fold with little work. This brings the performance (timeliness) near real time, and enables phase retrieval for active alignment feedback when assembling systems. Food for thought.

prysm itself makes no controls (at all) over threading or cpu/gpu, you can manipulate the environment variables prior to importing prysm or numpy to configure multi-threading, MPI, or other similar mechanisms to make the CPU go faster. Most systems are actually memory bandwidth limited on these sorts of platforms, so that tends to only scale well on 4-or-higher memory channel systems, like the intel Xeon based nodes in most cluster computers, or AMD ThreadRipper and EPYC workstations.